Dummify the categorical viriables

At the beginning, we decided to start with one variable with high absolute value of coefficient among all attributes. GrLivArea 0.708624

We can see that R square score is too low. Maybe it's because there are too few varibales.

The correlation coefficiency of the variables we think could be correlated to "SalePrice" should be either more than 0.3 or less than -0.3.And we assume the correlation coefficiency more than 0.7 or less than -0.7 shows the two variables highly correlated Based on the assumption above. We decide to choose the variables that meet the above conditions, drop the others and build a new dataframe。 LotFrontage:0.351799 Alley -0.534319 OverallQual 0.790982 YearBuilt 0.522897 YearRemodAdd 0.507101 MasVnrType 0.384602 MasVnrArea 0.477493 ExterQual 0.567079 Foundation -0.441842 BsmtQual 0.484765 BsmtFinSF1 0.386420 TotalBsmtSF 0.613581 HeatingQC 0.333038 1stFlrSF:0.605852 2ndFlrSF:0.319334 GrLivArea:0.708624 KitchenQual 0.460517 TotRmsAbvGrd 0.533723 Fireplaces 0.466929 OpenPorchSF 0.315856 PoolQC 0.543811 WoodDeckSF 0.324413 GarageFinish 0.513105 GarageCars 0.640409 GarageArea 0.623431 GarageYrBlt 0.486362 FullBath 0.560664

Drop the variable that has too many null value.

build the multiple linear regression model

We found a house in Iowa on sale with the feather as below. Address: 2632 Pinto Ln, Iowa City, IA 52240 (Our data is from Iowan real-estate market)

6 rooms above ground(3 bedrooms 3 bathrooms) Built in 2012 Total number of fireplaces: 1 2 Attached garage spaces Total interior livable area: 1,591 sqft Finished area below ground: 380

The full detail is as below. https://www.zillow.com/homedetails/2632-Pinto-Ln-Iowa-City-IA-52240/123317216_zpid/?

image.png

Rebuid and improve the model with couple of related variables, Use Osl See how quality impact the sales price

Based on the coefficient matrix above, 0.7 is relatively high, We assume that the quality of External, basement, heating, kitchen, fireplace, garage is correlated with sales price.

The p-value of HeatingQC(0.793),FireplaceQu(0.181),GarageQual(0.246) is too high(>0.05). so they should be eliminated. Improve the model

The result is not as good as expected. So we guess that the interaction between variables may impact the goodness of fit. So we decided to try adding interaction effect test (ExterQual:GarageQual and KitchenQual:FireplaceQu). Because we found that the External Quality and Garage Quality is highly correlated. The better the External Quality, the better Garage Quality. So do the correlation between Kitchen Quality and Fireplace Quality. After we add these 2 interaction effect. We suprisely found that the R-square improve.The p-value of ExterQual:GarageQual and KitchenQual:FireplaceQu tells us that the interaction effect test are is statistically significant. Consequently, we know that the quality of GarageQual depends on the quality of ExterQual. And the quality of fireplace depends on the quality of Kitchen. That’s the “it depends” nature of an interaction effect.

The R square is 0.556, which is even worse. So I decide to build the model with square feet of housing. And see what happened.

The coeffient shows that total square feet highly related to salesprice.

The result is pretty well. We raise the goodness of fit by building several models